Quantization Survey
This page contains my reading notes on
Problem:
Given a full precision number x, which is either a weight or an activation in the network, we want to only use 2^{k} number of distinct values \hat{x} to replace x in the inference time. 1. k here is called the bit-width. 1. The goal is to reduce the inference time and memory usage because less number of distinct values uses less memory and benefit from integer arithmetic hardware. 1. Unlike the application of quantization method used in signal processing, whose primary goal is to minimze the difference between the quantized values and the full-precision values, the quantization in neural network aims to minimize the accuracy drop, which can be achieved even if the average difference is huge.
Uniform Quantization
- Uniform quantization means that the values after the quantization are equally spaced: \hat{x}_{n} - \hat{x}_{n - 1} = \hat{x}_{n - 1} - \hat{x}_{n - 2}
- A widely used method of uniform quantization is as follows:
- The quantization operator maps the real values to a set of consecutive integers in the range of [-2^{k-1}, 2^{k-1} - 1]: Q(x) = \mathrm{round}(\frac{x - b}{s}) Here s is the scaling factor, b is the bias and \mathrm{round}() is to round the float to nearest integer.
- Both s and b can be directly calculated if we have selected a range of x to be [\alpha, \beta]: s = \frac{\beta - \alpha}{2^{k} - 1} b = \frac{\beta - \alpha}{2} The scaling factor essentially divide the range (\alpha, \beta) into 2^{k} numbers of same size partitions. The bias shifts the selected range to be zero centered.
- Finally, the quantized value \hat{x} that should be used in the inference can be mapped from Q(x): \hat{x} = sQ(x) + b
- Symmetric and asymmetric quantization
- If the selected range [\alpha, \beta] is symmetric around 0 i.e. \alpha = -\beta, then the quantization is called symmertric. Otherwise, it is called asymmertric.
- Symmetric quantization doesn’t require b (b=0) since the selected range is already zero centered. However, it can cause unused/over-used quantized value if the x is not symmertric.